CRUXEval-input

p-values

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.

Typical delta to give good p-values

We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.

Pairwise wins (including ties)

Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.

Result table

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.

model pass1 win_rate elo
0 gpt-4-0613+cot 0.755 0.952 1540.237
1 gpt-4-turbo-2024-04-09+cot 0.757 0.878 1380.317
2 gpt-3.5-turbo-0613+cot 0.503 0.790 1259.673
3 gpt-4-0613 0.698 0.743 1203.368
4 claude-3-opus-20240229+cot 0.734 0.714 1174.246
5 gpt-4-turbo-2024-04-09 0.685 0.712 1174.064
6 codellama-34b+cot 0.501 0.709 1173.731
7 codellama-13b+cot 0.474 0.646 1115.937
8 claude-3-opus-20240229 0.642 0.593 1068.591
9 codellama-7b+cot 0.404 0.587 1064.422
10 codetulu-2-34b 0.492 0.570 1049.250
11 codellama-34b 0.472 0.555 1036.052
12 deepseek-base-33b 0.465 0.537 1022.381
13 deepseek-instruct-33b 0.465 0.512 1002.395
14 gpt-3.5-turbo-0613 0.490 0.508 1000.000
15 codellama-python-34b 0.439 0.507 998.455
16 phind 0.472 0.500 993.536
17 codellama-13b 0.425 0.496 989.942
18 deepseek-base-6.7b 0.419 0.492 985.885
19 mixtral-8x7b 0.393 0.466 965.541
20 codellama-python-13b 0.397 0.466 965.146
21 magicoder-ds-7b 0.417 0.433 939.551
22 wizard-34b 0.427 0.429 937.456
23 codellama-python-7b 0.373 0.399 912.804
24 codellama-7b 0.360 0.386 901.755
25 mistral-7b 0.350 0.376 894.167
26 deepseek-instruct-6.7b 0.374 0.355 877.088
27 phi-2 0.316 0.352 873.544
28 wizard-13b 0.365 0.352 874.113
29 starcoderbase-16b 0.313 0.339 863.138
30 starcoderbase-7b 0.297 0.291 821.371
31 phi-1.5 0.232 0.274 806.429
32 deepseek-base-1.3b 0.278 0.251 783.953
33 deepseek-instruct-1.3b 0.272 0.242 774.527
34 phi-1 0.131 0.086 556.104